The American Community Survey is a survey run by the US Census Bureau that collects data on everything from the affordability of housing to employment rates for different industries. For this experiment, I will using the data derived from the American Community Survey for years 2010-2012. The team at FiveThirtyEight has cleaned the dataset and made it available on their Github repo.
Here's a quick overview of the files I'll be working with:
all-ages.csv - employment data by major for all agesrecent-grads.csv - employment data by major for just recent college graduatesBy completing this challenge, I will test your comfort with Pandas for manipulating DataFrames and calculating summary statistics.
In [1]:
import pandas as pd
all_ages = pd.read_csv("all-ages.csv")
all_ages.head(5)
Out[1]:
In both of these datasets, majors are grouped into categories. As you may have noticed, there are multiple rows with a common value for Major_category but different values for Major. We would like to know the total number of people in each Major_category for both datasets.
I will use the Total column to calculate the number of people who fall under each Major_category and store the result as a separate dictionary for each dataset. The key for the dictionary should be the Major_category and the value should be the total count. For the counts from all_ages, store the results as a dictionary named all_ages_major_categories and for the counts from recent-grads, store the results as a dictionary named recent_grads_major_categories.
In [29]:
all_ages = pd.read_csv("all-ages.csv")
all_ages_totals = all_ages.pivot_table(index="Major_category", aggfunc="sum").sort("Total", ascending=[0])["Total"]
all_ages_totals
Out[29]:
In [30]:
recent_grads = pd.read_csv("recent-grads.csv")
recent_totals = recent_grads.pivot_table(index="Major_category", aggfunc="sum").sort("Total", ascending=[0])["Total"]
recent_totals
Out[30]:
The press likes to talk a lot about how many college grads are unable to get higher wage, skilled jobs and end up working lower wage, unskilled jobs instead. As a data person, it is your job to be skeptical of any broad claims and explore if you can acquire and analyze relevant data to obtain a more nuanced view. Let's run some basic calculations to explore that idea further.
I will use the Low_wage_jobs and Total columns to calculate the proportion of recent college graduates that worked low wage jobs. Store the resulting Float object of the calculation as low_wage_percent.
In [19]:
recent_grads = pd.read_csv("recent-grads.csv")
low_wage_percent = 0.0
low_wage_sum = float(recent_grads["Low_wage_jobs"].sum())
recent_sum = float(recent_grads["Employed"].sum())
low_wage_percent = low_wage_sum / recent_sum
low_wage_percent
Out[19]:
So it looks like %12.3 percent of new grads are working in low-wage jobs.
Both all_ages and recent_grads datasets have 173 rows, corresponding to the 173 college major codes. This enables us to do some comparisons between the two datasets and perform some initial calculations to see how similar or different the statistics of recent college graduates are from those of the entire population.
We want to know the number of majors where recent grads fare better than the overall population. For each major, determine if the Unemployment_rate is lower for recent_grads or for all_ages and increment either recent_grads_lower_emp_count or all_ages_lower_emp_count respectively.
In [41]:
# All majors, common to both DataFrames
majors = recent_grads['Major'].value_counts().index
recent_grads_lower_emp=[]
all_ages_lower_emp=[]
for major in majors:
recent_unemply_rate = recent_grads[recent_grads["Major"]==major]["Unemployment_rate"].values[0]
all_time_unemply_rate = all_ages[all_ages["Major"]==major]["Unemployment_rate"].values[0]
diff = recent_unemply_rate - all_time_unemply_rate #comparator
if diff < 0:
recent_grads_lower_emp.append(major)
elif diff >0:
all_ages_lower_emp.append(major)
else:
pass #equal
In [42]:
len(recent_grads_lower)
Out[42]:
In [43]:
len(all_ages_lower)
Out[43]:
So it looks like for only 43/173 majors new grads have more success than older workers. It follows the old addage the experience is key in the job search. Let's take a look at what industries favor new grads:
In [44]:
recent_grads_lower_emp
Out[44]:
In [45]:
all_ages_lower_emp
Out[45]:
In [ ]: